Python Objects Under the Hood
Why Does sys.getsizeof(True) Return 28?
Open a Python REPL and run this:
import sys
print(sys.getsizeof(True)) # 28
print(sys.getsizeof(False)) # 28
print(sys.getsizeof(1)) # 28
print(sys.getsizeof(0)) # 24
print(sys.getsizeof(-1)) # 28
print(sys.getsizeof(2**30)) # 32
print(sys.getsizeof(2**60)) # 36
A Python boolean takes 28 bytes. A C bool takes 1 byte. That is a 28x overhead. A Python integer 0 takes 24 bytes. True takes 28. 2**30 takes 32. The size grows with the value.
This is not an accident or a flaw. It is a direct consequence of Python's object model. Every Python value - not just numbers, but every string, list, function, class instance - is a heap-allocated C struct with a reference count, a type pointer, and type-specific fields. Understanding this model explains every memory measurement you will ever take on a Python program.
PyObject: The Universal Header
Every Python object begins with the same 16-byte header defined in Include/object.h:
typedef struct _object {
Py_ssize_t ob_refcnt; // 8 bytes: reference count
PyTypeObject *ob_type; // 8 bytes: pointer to type object
} PyObject;
On a 64-bit system, at the address returned by id(obj):
Offset Size Field Description
──────────────────────────────────────────────────────
0 8 ob_refcnt How many Python names point to this object
8 8 ob_type Pointer to the PyTypeObject for this type
16 ? (type-specific fields follow)
The id() function in Python returns the memory address of the object - specifically the address of its ob_refcnt field. You can verify this with ctypes:
import sys
import ctypes
x = 42
print(hex(id(x))) # e.g. 0x10a3f2eb8
# Read ob_refcnt directly from C memory
refcnt = ctypes.c_ssize_t.from_address(id(x)).value
print(f"ob_refcnt = {refcnt}")
# Read ob_type pointer
ob_type = ctypes.c_void_p.from_address(id(x) + 8).value
print(f"ob_type addr = {hex(ob_type)}")
print(f"id(int) = {hex(id(int))}") # Should match - 42 is an int
Concrete Object Layouts
PyLongObject - Python int
In Python 3.12+, CPython introduced a compact integer representation. Small integers (fitting in 30 bits) store the value inline in the object header:
// Python 3.12+ compact integer representation (Include/cpython/longintrepr.h)
typedef struct _PyLongObject {
PyObject ob_base; // 16 bytes: refcnt + type pointer
// lv_tag encodes both the number of digits and sign information:
uintptr_t long_value; // For single-digit ints: value packed here
// For large integers, digits[] array follows:
// digit ob_digit[1]; // Variable-length array of 30-bit "digits"
} PyLongObject;
This is why the size grows with the value:
| Value | Digits needed | sys.getsizeof() |
|---|---|---|
| 0 | 0 (special case) | 24 bytes |
| 1–2^30 | 1 | 28 bytes |
| 2^30–2^60 | 2 | 32 bytes |
| 2^60–2^90 | 3 | 36 bytes |
Python integers are arbitrary precision. A 1000-digit integer works fine - it just takes (24 + 4 * ceiling(digits/30)) bytes.
PyFloatObject - Python float
typedef struct {
PyObject ob_base; // 16 bytes: refcnt + type
double ob_fval; // 8 bytes: IEEE 754 double
} PyFloatObject;
// Total: 24 bytes
Python floats are always 64-bit doubles. There is no 32-bit float in core Python (use NumPy float32 for that).
import sys
print(sys.getsizeof(3.14)) # 24
print(sys.getsizeof(0.0)) # 24 - same, always 24
PyListObject - Python list
typedef struct {
PyObject ob_base; // 16 bytes: refcnt + type
Py_ssize_t ob_size; // 8 bytes: current length
PyObject **ob_item; // 8 bytes: pointer to heap-allocated array of PyObject*
Py_ssize_t allocated; // 8 bytes: allocated capacity (>= ob_size)
} PyListObject;
// Fixed struct: 40 bytes
// Plus: ob_item array = allocated * 8 bytes on the heap (separate allocation)
import sys
lst = []
print(sys.getsizeof(lst)) # 56 (includes ob_item ptr, ob_size, allocated)
lst.append(1)
print(sys.getsizeof(lst)) # 88 (ob_item array now has 4 slots = 32 bytes)
# Note: list over-allocates to amortise append cost
lst2 = [0] * 1000
print(sys.getsizeof(lst2)) # 8056 = 56 + 1000 * 8
# But: each element is also a separate 28-byte PyLongObject!
# True total: 8056 + 1000 * 28 = 36,056 bytes (though small ints are cached)
The critical diagram:
PyListObject [lst] Heap: ob_item array
┌─────────────┐ ┌──────────────┐
│ ob_refcnt=1 │ │ ptr → int(1) │──→ PyLongObject(1)
│ ob_type=list│ │ ptr → int(2) │──→ PyLongObject(2)
│ ob_size=3 │──────────────────── │ ptr → int(3) │──→ PyLongObject(3)
│ ob_item ────┘ │ ptr → NULL │ (over-allocated slot)
│ allocated=4 │ └──────────────┘
└─────────────┘
Contrast with a C int array:
int arr[3] = {1, 2, 3};
┌────┬────┬────┐
│ 1 │ 2 │ 3 │ (12 bytes contiguous, no pointers, no header)
└────┴────┴────┘
A Python list of integers has two levels of pointer indirection (list → ob_item array → each integer object), while a C array is a single block of contiguous memory. This is why NumPy arrays are so much faster for numeric computation - they store values directly, like C arrays.
PyDictObject - Python dict
Python's dict is a hash table with open addressing. Since Python 3.6 it maintains insertion order; since 3.7 that order is guaranteed by the language spec.
// Simplified (actual struct is more complex with split-table optimisation)
typedef struct {
PyObject ob_base; // 16 bytes
Py_ssize_t ma_used; // 8 bytes: number of items
uint64_t ma_version_tag; // 8 bytes: version counter (for caches)
PyDictKeysObject *ma_keys; // 8 bytes: pointer to keys/values table
PyObject **ma_values; // 8 bytes: pointer to values (split dict)
} PyDictObject;
import sys
print(sys.getsizeof({})) # 64 bytes for empty dict
print(sys.getsizeof({'a': 1})) # 232 bytes - initial hash table allocation
print(sys.getsizeof({'a': 1, 'b': 2, 'c': 3, 'd': 4, 'e': 5})) # 360 bytes
Reference Counting Mechanics
CPython uses reference counting as its primary memory management strategy. Every PyObject has an ob_refcnt field. Rules:
- When a new reference to an object is created (assignment, argument passing, container insertion), call
Py_INCREF(obj)- atomically incrementsob_refcnt - When a reference is destroyed (variable goes out of scope, container element is removed), call
Py_DECREF(obj)- decrementsob_refcnt; if it hits 0, callsobj->ob_type->tp_dealloc(obj)immediately
The macros from Include/object.h:
static inline void Py_INCREF(PyObject *op) {
op->ob_refcnt++;
}
static inline void Py_DECREF(PyObject *op) {
if (--op->ob_refcnt == 0) {
_Py_Dealloc(op); // calls tp_dealloc
}
}
// Safe version - does nothing if op is NULL
static inline void Py_XDECREF(PyObject *op) {
if (op != NULL) {
Py_DECREF(op);
}
}
You can observe reference counting directly:
import sys
import ctypes
# Create an object and watch its refcount
lst = [1, 2, 3]
print(sys.getrefcount(lst)) # 2: one for 'lst', one for getrefcount() call
# Add another reference
alias = lst
print(sys.getrefcount(lst)) # 3: lst, alias, getrefcount
# Remove reference
del alias
print(sys.getrefcount(lst)) # 2: back to lst + getrefcount
# Put it in a container
container = [lst]
print(sys.getrefcount(lst)) # 3: lst, container[0], getrefcount
# Remove from container
container.clear()
print(sys.getrefcount(lst)) # 2: back to lst + getrefcount
# Function call creates a temporary reference
def check_refcount(obj):
# Inside the call, obj parameter is another reference
return sys.getrefcount(obj)
print(check_refcount(lst)) # 4: lst, check_refcount param, getrefcount, call stack
When Reference Counting Fails: Cycles
Reference counting alone cannot free cyclic garbage:
import gc
gc.disable() # Disable cyclic GC so we can observe the cycle
# Create a cycle: a and b reference each other
a = {}
b = {}
a['partner'] = b
b['partner'] = a
# Now del both names
del a
del b
# At this point:
# - 'a' dict: ob_refcnt = 1 (still referenced by b['partner'])
# - 'b' dict: ob_refcnt = 1 (still referenced by a['partner'])
# Neither hits 0, neither is freed
# This is a memory leak if gc.collect() is never called
# Force cyclic GC to collect
collected = gc.collect()
print(f"Collected {collected} unreachable objects")
# Collected 2 unreachable objects - the two dicts are now freed
gc.enable() # Re-enable
The cyclic GC (covered in Lesson 04) handles this. It is the reason pure reference counting is not enough.
The Small Integer Cache
CPython pre-allocates PyLongObject instances for every integer from -5 to 256. These objects are created at interpreter startup and are never freed. Any expression that produces an integer in this range returns a pointer to the pre-existing cached object rather than allocating a new one.
# Integers in [-5, 256] are cached singletons
a = 100
b = 100
print(a is b) # True - same object, same pointer
a = 257
b = 257
print(a is b) # False - different objects, different allocations
# The cache boundary is exactly 256
print(256 is 256) # True
print(257 is 257) # False (outside cache) - though CPython may optimise
# adjacent literal comparisons at compile time
# Verify with id()
print(id(0) == id(0)) # True - always the same object
print(id(256) == id(256)) # True
print(id(257) == id(257)) # Likely False in separate expressions
# The lower boundary
print(id(-5) == id(-5)) # True
print(id(-6) == id(-6)) # Likely False
The cache exists because small integers are used constantly (loop counters, indices, error codes). Without the cache, for i in range(1000000) would allocate and deallocate 1,000,000 PyLongObject structs.
You can see the cache in the CPython source in Objects/longobject.c:
#define NSMALLPOSINTS 257 // Cache 0 to 256
#define NSMALLNEGINTS 5 // Cache -5 to -1
// The cache array (simplified)
static PyLongObject small_ints[NSMALLNEGINTS + NSMALLPOSINTS];
Important production implication: using is to compare integers is a bug. a is b tests object identity (same pointer), not equality. It happens to work for small integers due to the cache, but fails unpredictably for larger values:
# Bug: using 'is' to compare integers
def validate_status(code):
return code is 200 # SyntaxWarning in 3.8+, wrong in all versions
# Correct: use ==
def validate_status_correct(code):
return code == 200
String Interning
CPython interns (deduplicates) certain strings - storing only one copy in memory and reusing the same object for all references to that string value.
Which strings are interned automatically?
- String literals that look like identifiers (only letters, digits, underscores) - interned at compile time
- Keys in dictionaries created with
{...}literal syntax - interned at compile time - Strings passed through
sys.intern()- explicitly interned
import sys
# Automatically interned: identifier-like strings
a = "hello"
b = "hello"
print(a is b) # True - same interned object
# Not automatically interned: contains space or special chars
a = "hello world"
b = "hello world"
print(a is b) # False (usually) - separate allocations
# Force interning with sys.intern
a = sys.intern("hello world")
b = sys.intern("hello world")
print(a is b) # True - both are now the same interned object
# Why this matters: dict key lookup
# CPython's dict uses pointer equality before hash equality for string keys
# Interned strings skip hash computation on lookup - pure pointer comparison
# Benchmark: interned vs non-interned as dict keys
import timeit
interned_key = sys.intern("user_id")
non_interned_key = "user_id" # Will likely be interned anyway for identifier chars
d = {interned_key: 42}
t1 = timeit.timeit(lambda: d[interned_key], number=1_000_000)
t2 = timeit.timeit(lambda: d[non_interned_key], number=1_000_000)
print(f"Interned key: {t1:.3f}s")
print(f"Non-interned key: {t2:.3f}s")
# For identifier-like strings the difference is minimal (both are interned)
# For computed strings, sys.intern() can meaningfully speed up dict lookups
String interning is especially important when using strings as dict keys in hot paths. The __dict__ of every Python object uses string keys for attribute names - those names are always interned, which is why attribute lookup obj.name is faster than obj.__dict__["some computed key"].
Type Objects: PyTypeObject as a Vtable
The ob_type pointer in every PyObject points to a PyTypeObject. This is Python's equivalent of a C++ vtable - a large struct of function pointers defining how objects of that type behave.
From Include/object.h (highly simplified - the real struct has ~50 fields):
typedef struct _typeobject {
PyObject_VAR_HEAD // The type object is itself a Python object
const char *tp_name; // "int", "list", "MyClass", etc.
Py_ssize_t tp_basicsize; // Size of instances (for PyObject_New)
Py_ssize_t tp_itemsize; // For variable-size objects
// Fundamental operations
destructor tp_dealloc; // Called when ob_refcnt hits 0
reprfunc tp_repr; // __repr__
hashfunc tp_hash; // __hash__
ternaryfunc tp_call; // __call__
reprfunc tp_str; // __str__
// Number protocol (arithmetic operators)
PyNumberMethods *tp_as_number; // nb_add, nb_subtract, nb_multiply, ...
// Sequence protocol (indexing, slicing)
PySequenceMethods *tp_as_sequence; // sq_length, sq_item, sq_contains, ...
// Mapping protocol (dict-like)
PyMappingMethods *tp_as_mapping; // mp_length, mp_subscript, mp_ass_subscript
// Object model
getattrofunc tp_getattro; // __getattr__ / attribute lookup
setattrofunc tp_setattro; // __setattr__ / attribute setting
initproc tp_init; // __init__
allocfunc tp_alloc; // Memory allocation
newfunc tp_new; // __new__
freefunc tp_free; // Memory deallocation (lower level)
// Garbage collection
traverseproc tp_traverse; // Visit referenced objects (for cyclic GC)
inquiry tp_clear; // Clear references (break cycles)
// Rich comparison
richcmpfunc tp_richcompare; // __eq__, __lt__, __gt__, etc.
// Descriptors
descrgetfunc tp_descr_get; // __get__
descrsetfunc tp_descr_set; // __set__
// Method resolution
PyObject *tp_bases; // Tuple of base types
PyObject *tp_mro; // Tuple of types in MRO order
PyObject *tp_dict; // Type's __dict__ (methods, class vars)
} PyTypeObject;
When you call len(lst), CPython does not look up a method in a dict. It calls lst->ob_type->tp_as_sequence->sq_length(lst) - a direct C function pointer call. This is why built-in operations on built-in types are fast.
# Observe the type object system from Python
lst = [1, 2, 3]
# ob_type is accessible as type()
print(type(lst)) # <class 'list'>
print(type(lst) is list) # True
# The type object has a dict with all methods
print(type(lst).__dict__.keys())
# dict_keys(['__repr__', '__hash__', '__getattribute__', '__lt__', '__le__',
# '__eq__', '__ne__', '__gt__', '__ge__', '__iter__', '__len__',
# '__getitem__', '__setitem__', '__delitem__', '__add__', ...])
# Type hierarchy
print(type(list)) # <class 'type'> - list's type is 'type'
print(type(type)) # <class 'type'> - type's type is itself
print(isinstance(list, type)) # True
print(isinstance(int, type)) # True
How isinstance() Works
isinstance(obj, cls) does not compare type pointers directly. It traverses the MRO (Method Resolution Order) checking if any base type matches:
# isinstance uses MRO traversal in C (Objects/abstract.c)
class Animal:
pass
class Dog(Animal):
pass
d = Dog()
print(isinstance(d, Dog)) # True - exact type match
print(isinstance(d, Animal)) # True - base class in MRO
print(isinstance(d, object)) # True - everything inherits from object
# MRO is precomputed and stored in tp_mro
print(Dog.__mro__)
# (<class '__main__.Dog'>, <class '__main__.Animal'>, <class 'object'>)
# isinstance() is O(depth of MRO) - typically O(1) for flat hierarchies
# For deep class hierarchies, isinstance() gets proportionally slower
The Cost of Attribute Lookup
When you write obj.name, Python does not simply look up 'name' in a dict. The full lookup path is:
obj.name lookup:
1. type(obj).__mro__ - iterate base classes
2. For each class C in MRO:
a. Check C.__dict__ for 'name'
b. If found and it is a data descriptor (has __get__ and __set__):
→ call descriptor.__get__(obj, type(obj)) ← HIGHEST priority
3. Check obj.__dict__ for 'name' ← instance attributes
4. For each class C in MRO (again):
a. Check C.__dict__ for 'name'
b. If found and it is a non-data descriptor (has __get__ only):
→ call descriptor.__get__(obj, type(obj))
5. Return the value (or AttributeError if nothing found)
This is the full descriptor protocol. It explains why @property (a data descriptor) takes precedence over instance __dict__ entries.
# The __dict__ behind every object
class Point:
def __init__(self, x, y):
self.x = x
self.y = y
p = Point(3, 4)
print(p.__dict__) # {'x': 3, 'y': 4}
# Each attribute lookup goes through the protocol above
# For a simple instance attribute with no descriptor: ~3 dict lookups
import timeit
# Attribute access vs dict access
t1 = timeit.timeit('p.x', setup='class P:\n def __init__(self): self.x=1\np=P()', number=10_000_000)
t2 = timeit.timeit('d["x"]', setup='d={"x":1}', number=10_000_000)
print(f"Attribute access: {t1:.3f}s")
print(f"Dict access: {t2:.3f}s")
# Attribute access is ~20-30% slower due to descriptor protocol overhead
__slots__: Eliminating __dict__
By default, every Python object has a __dict__ - a full dict object allocated on the heap. For a class with 3 attributes, that dict takes 200+ bytes just for its structure, on top of the object's fixed fields.
__slots__ replaces the __dict__ with fixed C-level slots:
import sys
class WithDict:
def __init__(self, x, y, z):
self.x = x
self.y = y
self.z = z
class WithSlots:
__slots__ = ('x', 'y', 'z')
def __init__(self, x, y, z):
self.x = x
self.y = y
self.z = z
d = WithDict(1, 2, 3)
s = WithSlots(1, 2, 3)
print(f"WithDict size: {sys.getsizeof(d)} bytes") # ~48 bytes
print(f"WithSlots size: {sys.getsizeof(s)} bytes") # ~40 bytes (no __dict__)
# The real difference is the __dict__ object itself
print(f"d.__dict__ size: {sys.getsizeof(d.__dict__)} bytes") # ~232 bytes
# WithSlots: NO __dict__ - attributes are stored as direct C struct fields
# Total memory per object:
# WithDict: ~280 bytes (object + __dict__ + per-item overhead)
# WithSlots: ~40 bytes
# Speed: slot access is faster (direct C struct field) vs dict lookup
import timeit
d_obj = WithDict(1, 2, 3)
s_obj = WithSlots(1, 2, 3)
t1 = timeit.timeit('obj.x', globals={'obj': d_obj}, number=10_000_000)
t2 = timeit.timeit('obj.x', globals={'obj': s_obj}, number=10_000_000)
print(f"__dict__ attr access: {t1:.3f}s")
print(f"__slots__ attr access: {t2:.3f}s")
# __slots__ is typically 15-30% faster for attribute access
Use __slots__ when you create large numbers of instances of a class (e.g., nodes in a tree, entries in a cache, data records) and the attribute set is fixed at class definition time.
Memory Layout Comparison: Python List vs C Array
The core reason Python is slow for numeric computation:
Python list [1, 2, 3]:
PyListObject (40 bytes on heap):
┌─────────────────┐
│ ob_refcnt = 1 │ 8 bytes
│ ob_type = *list │ 8 bytes → PyTypeObject for list
│ ob_size = 3 │ 8 bytes
│ ob_item = *─────┼──────────────────┐
│ allocated = 4 │ 8 bytes │
└─────────────────┘ ▼
ob_item array (32 bytes on heap):
┌──────────┬──────────┬──────────┬──────────┐
│ptr → int1│ptr → int2│ptr → int3│NULL │
└──────────┴──────────┴──────────┴──────────┘
│ │ │
▼ ▼ ▼
PyLongObject PyLongObject PyLongObject
(28 bytes) (28 bytes) (28 bytes)
{refcnt, {refcnt, {refcnt,
type, type, type,
value=1} value=2} value=3}
Total: 40 + 32 + 3×28 = 156 bytes for 3 integers
Cache behaviour: 3 pointer dereferences to read one int value
C int array int arr[3] = {1, 2, 3}:
┌────┬────┬────┐
│ 1 │ 2 │ 3 │ 12 bytes (contiguous), no heap allocation
└────┴────┴────┘
Cache behaviour: sequential, prefetcher-friendly, one load per element
NumPy array np.array([1, 2, 3], dtype=np.int64):
┌───────────────────────────────────┐
│ PyObject header + ndarray fields │ ~120 bytes for the array object
└───────────────────────────────────┘
│
▼ (single heap allocation)
┌──────────────┬──────────────┬──────────────┐
│ int64(1) │ int64(2) │ int64(3) │ 24 bytes contiguous
└──────────────┴──────────────┴──────────────┘
Cache behaviour: same as C array - sequential, prefetcher-friendly
This layout difference is fundamental. NumPy is fast not primarily because of algorithmic superiority but because its data is stored like a C array, enabling vectorisation (SIMD instructions) and CPU cache efficiency.
Interview Q&A
Q1: Why does sys.getsizeof(True) return 28 bytes? What is the memory layout?
True is a PyBoolObject, which is a subtype of PyLongObject. The 28 bytes break down as: 8 bytes for ob_refcnt (the reference count), 8 bytes for ob_type (pointer to the bool type object), and then the integer value fields. In CPython 3.12+, small integers use a compact representation where the value is encoded in the lv_tag field and a single ob_digit (4 bytes). The sys.getsizeof result includes the fixed header (16 bytes) plus the compact integer storage (approximately 12 bytes for the tag and one digit). The reason it is not 1 byte like a C bool is that every Python object must carry its reference count and type pointer regardless of what it stores - Python's dynamic type system has a fixed per-object overhead of at least 16 bytes.
Q2: What is the small integer cache? When does it cause bugs?
CPython pre-allocates PyLongObject instances for every integer from -5 to 256 at interpreter startup. Integers in this range are singletons - a = 100; b = 100; a is b is True because both a and b point to the same pre-allocated object. Integers outside this range get fresh allocations: a = 300; b = 300; a is b is False.
The bug it causes: using is instead of == to compare integers. Code like if status_code is 200 appears to work during development because HTTP status codes are often small integers, but fails in production when the status code is 304, 404, or 500. The correct comparison is always == for value equality. Python 3.8+ issues a SyntaxWarning for is comparisons with integer literals precisely to catch this bug.
Q3: What is the descriptor protocol? How does @property use it?
A descriptor is any object with a __get__ method (and optionally __set__ and __delete__). When attribute lookup finds an object with __get__ in a class's __dict__ (not the instance's __dict__), it calls descriptor.__get__(instance, type) instead of returning the raw object.
@property creates a property object - a data descriptor (it has both __get__ and __set__) - and stores it in the class's __dict__ under the attribute name. When obj.x is accessed and x is a property in type(obj).__dict__, the lookup protocol calls property.__get__(obj, type(obj)), which calls the getter function. Because property is a data descriptor, it takes precedence over instance __dict__ entries - obj.__dict__['x'] = 5 does not bypass the property getter.
classmethod, staticmethod, __slots__, and super() all use the descriptor protocol internally.
Q4: What are __slots__ and when should you use them?
By default, every Python instance has a __dict__ - a dictionary allocated on the heap storing the instance's attributes. __dict__ typically consumes 200-400 bytes even when empty, due to the dict's internal hash table structure.
__slots__ is a class variable that replaces __dict__ with fixed C-level storage. When you declare __slots__ = ('x', 'y'), CPython creates member_descriptor objects (a type of descriptor) in the class's __dict__ for x and y, and each instance stores those values as direct fields in the object struct rather than in a dict.
Benefits: (1) memory reduction of 200-400 bytes per instance - critical when creating millions of objects; (2) ~15-30% faster attribute access because it is a direct C struct field access instead of a dict lookup; (3) prevents accidental creation of new attributes (raises AttributeError).
Limitations: cannot add attributes not listed in __slots__; __weakref__ is not automatically available; subclassing works but requires care if the subclass also uses __slots__.
Use __slots__ for data classes, tree/graph nodes, cache entries - any class where you create large numbers of instances with a fixed, known set of attributes.
Q5: How does Python determine which __add__ to call when you write a + b?
The BINARY_OP opcode calls binary_op() in ceval.c, which calls PyNumber_Add(). The logic: (1) get the type of a as ta = type(a), (2) get nb_add = ta->tp_as_number->nb_add, (3) if nb_add is not NULL, call result = nb_add(a, b). If the result is NotImplemented, check whether type(b) is a subtype of type(a) - if so, try type(b).__radd__ first. If not a subtype, try type(a).__add__ first, then type(b).__radd__ as fallback.
For int + int, this resolves immediately to long_add() in Objects/longobject.c. For user-defined classes, it calls through the tp_as_number slot which either points to a C function (for built-in types) or a wrapper that calls the Python __add__ method through the interpreter. The subtype-first rule ensures that subclasses can override arithmetic operators and have their version called even when the left operand is the base type.
